{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 23 - Confidence intervals for regression\n", "\n", "We will use newborn data set from Lab 16, [babies.data](https://www.stat.berkeley.edu/~statlabs/data/babies.data). This data is from a sample of newborns born between 1960 and 1967 in California in a major hospital system.\n", "\n", "The columns are:
\n", "bwt: Birth weight in ounces (999 unknown)
\n", "gestation: Length of pregnancy in days (999 unknown)
\n", "parity: 0= first born, 9=unknown
\n", "age: mother's age in years (99 unknown)
\n", "height: mother's height in inches (99 unknown)
\n", "weight: Mother's prepregnancy weight in pounds (999 unknown)
\n", "smoke: Smoking status of mother: 0=not now, 1=yes now, 9=unknown\n", "\n", "We will look at the relationship between birth weight and number of gestational days, and the relationship between maternal age and birth weight.\n", "\n", "First, let's import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "import statsmodels.formula.api as smf\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading and cleaning the data\n", "\n", "Read the CSV file into the dataframe `babies`. As in Lab 16, we need to add the parameter `sep = \"\\s+\"` since the columns are separated by whitespaces instead of commas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display your `babies` dataframe below to check it was created properly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a histogram of the `gestation` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice about the histogram? What's the explanation? (Hint: Read the description of the gestation column above.)\n", "\n", "If the number of gestational days is missing, 999 is used instead. We need to read in the dataset again, this time telling `read_csv()` that 999 represents an NaN value. Can you figure out how to do this? See Labs 18 and 22 where we specified a different NaN value for the Starbucks data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Hint:\n", "Add the parameter na_values = \"999\"\n", "
\n", "\n", "\n", "Drop any row with an NaN value (also done in Lab 18 and 22):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", "df = df.dropna(axis = 0)\n", "
\n", "\n", "Let's check the unknown values have been successfully dropped by plotting a histogram of the weight or gestation column again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Relationship between birth weight and gestational days.\n", "\n", "In Lab 16 we looked at the relationship between birth weight and the number of gestational days (how long the baby was in the womb). We will look at this relationship again using regression.\n", "\n", "First, we'll visualize the data. Use the Seaborn library (Lab 21) to plot a scatter plot with the regression line, where gestational days are on the x axis and birth weight (`bwt`) is on the y axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", "sns.regplot(x = \"x_axis_column_name\", y = \"y_axis_column_name\", data = df)\n", "
\n", "\n", "Looking at the graph, do you think there is a linear relationship between the number of gestation days and the birth weight?\n", "\n", "There appears to be a weak linear relationship between the gestation days and birth weight. Let's compute the correlation matrix to get the correlation between the two variables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", "df.corr()\n", "
\n", "\n", "What's the correlation between gestational days and birth weight? Is it high, low, or in the middle?\n", "\n", "Next let's calculate this slope by computing the linear model, and then displaying the model summary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", "linear_model_variable = smf.ols(formula = \"dependent_var ~ independent_var\", data = df).fit()\n", "linear_model_variable.summary()\n", "
\n", "\n", "What's the slope of the regression line?\n", "\n", "Remember that this relationship is only based on a sample of data. A different sample would give a different regression line and slope. To understand how much the slope could change depending on the sample, we will compute the 95% confidence interval for the slope. \n", "\n", "First, we will create a function for computing the regression and extracting the slope from it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# df is the name of the dataframe\n", "# x is the name of the column for the independent variable\n", "# y is the name of the column for the dependent variable\n", "def slope(df, x, y):\n", " formula_string = y + \" ~ \" + x # create a string containing the formula in advance\n", " lm = smf.ols(formula = formula_string, data = df).fit()\n", " return lm.params[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try calling (running) this function on our data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "slope(babies,\"gestation\",\"bwt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Was the slope the same as your previous computation?\n", "\n", "To find the 95% confidence interval, we will:\n", "- create an empty list to store the slopes\n", "- take 1000 bootstrap samples the same size as the original data\n", "- for each sample, compute the slope of the regression line using our function and save it in our list\n", "\n", "The pseudo-code is:\n", "\n", "slopes = []\n", "loop 1000 times:\n", " take a bootstrap sample (take a sample with replacement the same size as the original data)\n", " compute the slope of the regression line of the sample\n", " add the sample to `slopes`\n", "\n", "\n", "Try writing the actual code below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "slopes = []\n", "for i in range(1000):\n", " sample = babies.sample(1188,replace = True)\n", " sample_slope = slope(sample,\"gestation\",\"bwt\")\n", " slopes.append(sample_slope)\n", "
\n", "\n", "Plot the histogram of the slopes of the samples:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the mean of the slopes? How does this compare to the slope of the actual data?\n", "\n", "Were any of the sample slopes 0? Does this suggest a different sample of the data could have slope 0?\n", "\n", "Finally, let's compute the 95% confidence interval by computing the 0.025 quantile (2.5 percentile) and 0.975 quantile (97.5 percentile). We can compute the 0.025 quantile with the code `pd.Series(slopes).quantile(0.025)`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's your 95% confidence interval?\n", "\n", "It should be close to (0.39,0.56).\n", "\n", "### Maternal Age and Birth Weight\n", "\n", "Let's look at whether there is a relationship between maternal age and birth weight. That is, can maternal age in any way predict the birth weight?\n", "\n", "Plot a histogram of the maternal ages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice about the histogram? Why do you think this is the case? (Hint: look at the descriptions of the columns at the top of the lab)\n", "\n", "No, there wasn't a medical miracle of a 99 year old mother giving birth! 99 is used in the maternal age column to indicate the age is unknown. We should remove the row with the missing age from our dataset. However, we can't add 99 to the list of missing values, because a birth weight could be 99oz (= 6.2 lbs). Therefore, we need to use the following code, which tells Pandas that there is a different missing values for each column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "babies = pd.read_csv(\"../Data/babies.data\",sep = \"\\s+\", na_values = {\"gestation\":\"999\",\"age\":\"99\"})\n", "babies = babies.dropna(axis = 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try plotting a histogram of the maternal ages again. Did we remove the outlier?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First plot a scatter plot and regression line of maternal age (x) vs. birth weight (y)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does the slope look close to 0? If the slope is 0 it would indicate no relationship. However, for any particular sample of data, the slope is unlikely to be exactly 0. Therefore, we want to construct the 95% confidence interval for the slope and check if it contains 0.\n", "\n", "We will now compute the 95% confidence interval. \n", "\n", "First, compute 1000 bootstrap samples, saving the slope of the regression line of each sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a histogram of the slopes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does it look like the 95% confidence interval for the slope will contain 0?\n", "\n", "Let's formally calculate the 95% confidence interval for the slope by computing the 0.025 and 0.975 quantiles:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the 95% confidence interval for the slope? Does it contain 0?\n", "\n", "If so, we must conclude that there is not enough evidence to show a relationship between maternal age and birth weight.\n", "\n", "### Challenges:\n", "- Is there a relationship between maternal height and birth weight? That is, can birth weight in any way be predicted from maternal height?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }